14 research outputs found

    Leksikograafilise tarkvara Sketch Engine eesti keele moodul

    Get PDF
    2010. aasta sügisel alustas Eesti Keele Instituut koos ettevõttega Lexical Computing Ltd. leksikograafilise tarkvara Sketch Engine (Kilgarriff jt 2004) eestikeelse mooduli väljatöötamist. Artiklis kirjeldatakse programmi põhifunktsioone. Põhjalikumalt käsitletakse funktsiooni Word Sketch (ee sõnavisand) võimalusi. Tutvustatakse sõnavisandite grammatika koostamise põhimõtteid, vaadeldakse eraldi substantiivide, adjektiivide ja verbide sõnavisandites esitatud süntagmaatilisi seoseid (st grammatilisi ja leksikaalseid kollokatsioone) ning arutletakse mooduli edasiarendusvõimaluste üle. Lisaks analüüsitakse, mil määral saab sõnavisandeid kasutada verbide lausemallide tuvastamise

    State-of-the-art on monolingual lexicography for Estonia

    Get PDF
    The paper describes the state of the art of monolingual lexicography in Estonia. Firstly, we describe the current situation in Estonia and the main public functions performed by the Institute of the Estonian Language. Secondly, we provide an overview of the primary types of monolingual academic dictionaries (dictionaries of Standard Estonian and explanatory dictionaries) published in Estonia since the 20th century. Monolingual learner’s lexicography has emerged as a new field in the 2010s, focusing on basic vocabulary and collocations. Thirdly, we give a short overview of accessibility policy and availability of language resources for Estonian. Finally, we envisage the future work in the field of lexicography in the Institute. Within the framework of the new dictionary writing system Ekilex the Institute is moving away from presenting separate interfaces for different dictionaries towards a unified data model in order to provide the data in the aggregated form

    D3.8 Lexical-semantic analytics for NLP

    Get PDF
    UIDB/03213/2020 UIDP/03213/2020The present document illustrates the work carried out in task 3.3 (work package 3) of ELEXIS project focused on lexical-semantic analytics for Natural Language Processing (NLP). This task aims at computing analytics for lexical-semantic information such as words, senses and domains in the available resources, investigating their role in NLP applications. Specifically, this task concentrates on three research directions, namely i) sense clustering, in which grouping senses based on their semantic similarity improves the performance of NLP tasks such as Word Sense Disambiguation (WSD), ii) domain labeling of text, in which the lexicographic resources made available by the ELEXIS project for research purposes allow better performances to be achieved, and finally iii) analysing the diachronic distribution of senses, for which a software package is made available.publishersversionpublishe

    An insight into lexicographic practices in Europe Results of the extended ELEXIS Survey on User Needs

    No full text
    The paper presents the results of a survey on lexicographic practices and lexicographers’ needs across Europe that was conducted in the context of the Horizon 2020 project European Lexicographic Infrastructure (ELEXIS) among the observer institutions of the project. The survey is a revised and upgraded version of the survey which was originally conducted among ELEXIS lexicographic partner institutions in 2018 (Kallas et al. 2019a). The main goal of this new survey was to complement the data from the ELEXIS lexicographic partner institutions in order to get a more complete picture of lexicographic practices both for born-digital and retro-digitised resources in Europe. The results offer a detailed insight into many aspects of the lexicographic process at European institutions, such as funding, training, staff, lexicographic expertise, software and tools. In addition, the survey reflects on current trends in lexicography and reveals what institutions see as the most important emerging trends that will affect lexicography in the short-term and long-term future. Overall, the results provide valuable input informing the development of tools, resources, guidelines and training materials within ELEXIS

    The EKI Combined Dictionary 2022 (ELEXIS)

    No full text
    Eesti Keele Ühendsõnastik 2022 (EKI Combined Dictionary 2022) displays information from different lexical databases: "The Dictionary of Estonian 2019", "Estonian Collocations Dictionary 2019", "Basic Estonian Dictionary" (2014), "The Estonian Morphological Database of the Institute of the Estonian Language 2022". It displays also information from bilingual lexical databases: "Estonian-Russian orthographic dictionary for students 2018" (1st edition 2011), "Estonian-Russian Dictionary 2018" (1st edition 1997–2009), "The Russian Morphological Database of the Institute of the Estonian Language 2022". The data is stored in Ekilex's PostgreSQL database and accessible through API. Ekilex is in-house DWS of the Institite of the Estonian Language. Ekilex is hosted in the Estonian Scientific Computing Infrastructure (ETAIS) cloud. See also: https://doi.org/10.15155/3-00-0000-0000-0000-08C0A

    Parallel sense-annotated corpus ELEXIS-WSD 1.0

    No full text
    ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.0 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfying semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with POS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in a CONLL-like tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its whitespace information (whether the token is followed by a whitespace or not), the ID of the sense assigned to the token, and the index of the multiword expression (if the token is part of an annotated multiword expression). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt
    corecore